A Hybrid Multicast-Unicast Infrastructure for Efficient 
Publish-Subscribe in Enterprise Networks 



Danny Bickson, Ezra N. Hoch, Nir Naaman, Yoav Took 
IBM Haifa Research Lab, 
Mount Carmel, Haifa 31905, Israel 
{dannybi,ezrah,naaman,tock}@il. ibm.com 



ABSTRACT 

One of the main ciiallenges in building a large scale publish sub- 
cribe infrastructure in an enterprise network, is to provide the sub- 
scribers with the required information, while minimizing the con- 
sumed host and network resources. Typically, previous approaches 
utilize either IP multicast or point-to-point unicast for efficient dis- 
semination of the information. 

In this work, we propose a novel hybrid framework, which is a 
combination of both multicast and unicast data dissemination. Our 
hybrid framework allows us to take the advantages of both multi- 
cast and unicast, while avoiding their drawbacks. We investigate 
several algorithms for computing the best mapping of publishers' 
transmissions into multicast and unicast transport. 

Using extensive simulations, we show that our hybrid framework 
reduces consumed host and network resources, outperforming tra- 
ditional solutions. To insure the subscribers interests closely re- 
semble those of real-world settings, our simulations are based on 
stock market data and on recorded IBM WebShpere subscriptions. 

Categories and Subject Descriptors 

C.2.I [Computer Communication Networks]: Network Archi- 
tecture and Design 

General Terms 

Performance 

Keywords 

IP-Multicast, publish-subscribe 

1. INTRODUCTION 

Consider a large-scale publish-subscribe application that is char- 
acterized by a large number of information flows, as well as a large 
number of subscribers. Each information flow generates messages 
which must be delivered to an interested subset of subscribers. Sub- 
scribers are interested in different, yet possibly overlapping, sub- 
sets of the information flows. Naturally, an individual information 
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flow may be required by many subscribers. A typical example is 
a financial market data dissemination system, where the flows can 
be stock quotes (of which there are tens of thousands), commod- 
ity prices, etc., and subscribers are traders, analysts and so on (in 
hundreds). Each subscriber is interested in a different portfolio. 

One of the two common approaches in the above dissemination 
scenario is to utilize IP multicast to transmit the data. In this work 
we assume that IP multicast service is supported in the enterprise 
network. Given that overlaps between subscribers' interests are not 
rare, transmission costs can be reduced by grouping information 
flows into groups, and using multicast to disseminate these flows 
to subscribers. This mechanism requires two mappings: one be- 
tween flows and groups (mapping of a flow to one or more mul- 
ticast group), and another mapping between users and multicast 
groups (such that each subscriber gets all the information she is in- 
terested in). The problem of finding the mappings which minimize 
consumption of network resources is termed "the channelization 
problem" |Adler et al.(200I)| . 

Using multicast as a mean of dissemination has some limitations. 
Typically, there is a limited amount of multicast addresses which 
can be used, due to the memory and computational overhead of the 
network devices. 

In our setting, the number of flows is much larger than the num- 
ber of available multicast groups, which means that a one-to-one 
mapping of flows to multicast groups is not possible. Thus, differ- 
ent flows have to be batched into the same multicast group. As a 
result subscribers may receive data they are not interested in and 
which they must filter. As shown in [Carmeli et al.(20()4)l , filtering 
in the end hosts is one of the reasons for reduced performance in a 
high bandwidth enterprise network. 

A second common approach is to use point-to-point connections, 
where each publisher transmits the information required using uni- 
cast. The main drawback of solely using unicast is the poor utiliza- 
tion of network resources when many subscribers are interested in 
the same data flow. In this case, the transmitter has to transmit the 
same data many times to different users, which results in a waste of 
transmitter resources like bandwidth, CPU and memory as well as 
wasted network bandwidth. 

In the current paper, we propose a novel hybrid approach, which 
combines both unicast and multicast transports. In our approach, 
we allow a flexible allocation of flows into unicast and multicast 
connections, avoiding the inherent drawbacks of using a single scheme. 
Thus, we gain high efficiency when many subscribers are interested 
in the same data flow by utilizing multicast, and use unicast to re- 
duce unneeded filtering, whenever the subscription to certain flows 
is relatively rare. 

We define the hybrid unicast-multicast problem as an optimiza- 
tion problem, and explore several heuristics to solve it. Using ex- 



tensive simulations, we compare different approaches for solving 
the hybrid problem and identify which perform best, under differ- 
ent scenarios. As an additional contribution, we explore different 
algorithms for solving the related channelization problem, which 
is NP-hard, and identify a single algorithm which outperforms the 
others. 

The paper is organized as follows. Section [2] overviews the re- 
lated work and explains the novelty in our hybrid approach. Sec- 
tion [3] describes the problem model and formally defines the hy- 
brid problem, showing it is a NP-Hard problem. Section|4]presents 
our proposed heuristics for solving the hybrid problem. Section [5] 
gives extensive experimental results which compare the different 
heuristics and their operation under various real-world scenarios. 
We conclude in Section[6l 

2. RELATED WORK 

Publish-subscribe systems have been the target of extensive re- 
search in the last years. Research has focused on the problem 
of disseminating data efficiently to interested users. Two main 

paradigms were proposed: content-based multicast and subject- 

based multicast |Levine et al.(2000)||Ganguly et al.(2006)||Lety et al.(2004)| ). 
Different extensions to the paradigms include | |Zhang and Hu(2005)| 
where a hybrid approach for content-based and subject based dis- 
semination is proposed. Another example is |Cao and Singh(2005)| 
which proposes a solution for a setting in which dynamic changes 
of the multicast groups is required. In | Opyrchal e t al.(20()0)) content- 
based dissemination is implemented using IP multicast. 

One of the main challenges when considering subject-based mul- 



Let Ykxm denote the mapping from multicast groups to users: 



Y — 



1 user j receives multicast group i 
otherwise 



Let T„xm denote the unicast matrix: 



1 flow i is sent to user j using unicast 
otherwise 



• Let Aixji denote the rate of the flows where A; is the rate of 
flow i. 

3.1 The Channelization Problem 

Given m users, n flows, k multicast groups, a vector of flow rates 
A and an interest matrix W, the channelization problem |Adler et al.(2001)^ 
aims at finding two mapping matrices X, Y that minimize the cost 
of transmission (using only multicast groups), under the constraint 
that each user receives all the flows it is interested in. A schematic 
diagram of the channelization mappings is given in Figure[T] 
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ticast is in solving the channelization problem ((Adler e t al.(2001) 
Wong et al.(1999)Wong, Katz, and McCanne| |Cevher e t al.(2008) 
Tock et al. (2005)1). Previous approaches map flows into multicast 
groups, while the current paper allows for both multicast and uni- 
cast transmissions. In Section |5] we empirically compare several 
algorithms for solving the channelization problem, identifying a 
single algorithm which outperforms the others. 

A closely related work to ours is Dr. Multicast (Vigfusson et al.(2008)| 
which proposes to use unicast as well as multicast in a data center 
information dissemination scenario. However, pigfusson et al.(200^ 
focuses on the management and stability issues of IP rnulticast in 
the data-center, and does not explicitly explore the question of map- 
ping flows into multicast and unicast in a quantitative manner. To 
the best of our knowledge, we are the first work which formally de- 
fines the problem as an optimization problem, and explores several 
heuristics to solve it. 

3. MODEL AND PROBLEM DEFINITION 




Figure 1: Schematic picture of the channelization mapping. 



To formally define the cost function, let wi , W2 be real positive 
numbers. 
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We use the following notations as in | Adler et al. (200 1)1 . 

• Let m be the number of users. 

• Let n be the number of flows. 

• Let k be the number of mulitcast groups. 

• Let Wnxm denote the interest matrix: 



Wi. 



1 user j is interested in flow i 
otherwise 



Let Xn X k denote the mapping from flows to multicast groups : 

^ _ J 1 flow i is mapped to multicast group j 
I otherwise 



W2'^'^Xij\i . 

i=i j=i 

The cost consists of two terms; the first sums all transmission re- 
ceived by subscribers. For each user h it sums the number of times 
h receives any given flow i, times the rate A^ of flow i. The second 
term sums the transmission of the senders; that is, each flow i is 
summed according to the number of multicast groups it is transmit- 
ted to, times the rate Ai. wi, ■W2 are factors which weight the two 
terms. The channelization problem is defined as: 

minC(X) 

X,Y 

s.t. XY >W . 

In other words, given a set of users U, a set of multicast groups M, 
a set of flows F, an interest matrix W and a flow-rate vector A; find 
X, Y that minimize C{X, Y) under the constraint that XY > W. 



3.2 The Hybrid Channelization Problem 

Below we model our hybrid framework as an optimization prob- 
lem. Unlike the original channelization problem, the transmitters 
may send flows using unicast. That is, any flow / can be trans- 
mitted using unicast to any user h. In the hybrid problem the cost 
function C' is composed of three terms: 

rt k m 

C'{X,Y,T) = ^^^^X,,,n,,A.+ 

i^l j^l h^l 
n k 

i = l j = l 

n m 

)I]I]r.,hA, (1) 

1=1 h=l 

The additional term represents the cost of all the flow i-user h 
pairs such that flow i is sent using unicast to user h, multiplied 
by the cost of transmission. The cost of transmission of a flow 
consists of the cost of sending the flow (1^2), the cost of receiving 
the flow (w'l), and the cost incurred by the unicast mechanism, UI3 
(additional memory requirements, etc). For the rest of the paper, we 
assume that the transmitting and receiving costs are equal {w'l = 
'w'2) and that the unicast cost equals their sum (i.e., UI3 = 1). 

Using the cost function C {X, Y, T) the hybrid channelization 
problem can be formally defined. Given m users, k multicast groups, 
n flows, an interest matrix W and a flow-rate vector A, the hybrid 
channelization problem is defined as: 

min C'(X,Y,T) 

X,Y,T 

s.t XY + T>W. 

The constraint XY + T >W requires that each user h requesting 
flow i will either receive i via unicast or via a multicast group h 
listens to. (h may receive flow i via both multicast and unicast; 
however, that would be wasteful.) 

3.3 Intractability of the Hybrid Channeliza- 
tion Problem 

Theorem 1. The hybrid channelization problem is NP-Hard. 

Proof. In [Adler et al.(2001)| it was shown that the non-unicast 
problem is NP-Hard, therefore the unicast channelization problem 
can be reduced to the non-unicast channelization problem as a proof 
of its hardness. The reduction is simple: given n,m,k,W,Wi,W2, X 
as input to the channelization problem, construct an input to the 
hybrid channelization problem which is exactly the same, with a 
single modification. Setting w'^ to be large than C(l„xfc, Ifexm) 
ensures that any solution X, Y, T does not have a lower cost than 
X,Y,Onxm- Thus, the minimal cost is the same as in the non- 
unicast setting. □ 

4. PROPOSED ALGORITHMS 

We propose the following two-step framework for solving the 
hybrid problem. In the first step, we solve the channelization prob- 
lem, without assigning any unicast flows. In the second step, we 
heuristically select some of the flows to be sent using unicast, and 
update the subscription matrix W accordingly. 

This process can be repeated several times, as long as the system 
cost is reduced. Simulation results show that repeating the process 



does not significantly improve system cost, while having a high 
computational cost. 

The above process can be viewed as starting from a solution that 
uses only multicast, and then using unicast to greedily improve the 
solution. Alternatively, one can start with a solution that uses only 
unicast (i.e., T = W), and greedily improve it by moving flows to 
multicast. Our simulations show that the first framework operates 
better than the latter one; while both of them improve upon previous 
non-hybrid solutions. 

4.1 First Step: Solving the Channelization 
Problem 

Previous work |Tock et al.(2005)||Adler et al.(2001)^ discuss sev- 
eral heuristics for solving the channelization problem. Adler et 
al. examine several heuristics, among them, random assignment and 
user and flow based merges. Tock et al. proposed a variant of the 
K-Means algorithm which greedily minimizes the cost on each it- 
eration. 

In this work, we extensively compare the different previous ap- 
proaches, while exploring new algorithms. We have utilized an 
algorithm from the data mining domain, called Binary Matrix De- 
composition (BMD |Li(2005a)|[Li(2005b)| ) which is originally used 
for two-sided binary clustering of documents and keywords into 
document classes. The basic idea is that the global cost function 
for minimization is: 

min\\XY-W\\l 

subject to X,y G {0, 1} 

which means we are looking for a decomposition of the interest ma- 
trix W into two binary matrices X, Y so that the Euclidian norm 
between XY and W is minimized. An alternating algorithm is 
derived by starting with an initial guess X, solving Y which min- 
imizes the cost function and then continuing in rounds. There are 
some drawbacks in using this algorithm: first, it does not sup- 
port variable flow rates. Second, it allows for some flows to be 
missing. Despite those drawbacks it has reasonable performance 
when operating on small systems (i.e., 200 flows, 10 multicast 
groups, 100 users). However, when operating on larger systems 
(i.e., 10000 flows, 100 multicast groups, 250 users) it does not im- 
prove upon a random selection of a solution. Therefore, we have 
omitted the experimental results of the BMD algorithm from the 
following graphs. 

We have also utilized the Matlab™ K-Means algorithm [Seber(1984)l 
|Spath(I985)| which is a two phase algorithm. In the first phase 
points are reassigned to their nearest cluster centroid, all at once, 
followed by recalculation of cluster centroids. The second phase 
uses "on-line" updates, where points are individually reassigned 
while reducing the total cost function, and cluster centroids are re- 
computed after each reassignment. 

We further investigated an interior point algorithm. Starting from 
the original problem formulation by Adler et al. , the binary map- 
ping matrices X and Y are relaxed to be in the range (0, 1). The 
constraints that X > 0, F > 0, X < 1, F < 1 and XY > W are 
incorporated into the cost function using the log-barrier technique 
( |Boyd and Vandenberghe(2004)| ) and then the Newton method is 
applied. After convergence, the solution is rounded to receive bi- 
nary X and Y. Unfortunately, the interior point method did not 
perform well in practice. Some of the reasons are that the prob- 
lem is neither concave nor convex. We have usually received a 
good fractional solution, but when the solution was rounded to the 
closest integer solution, it did not compare favorably to the other 



Algorithm 


Running time 


Random assignment 
K-means |Tock et al.(2005)| 


0{n + mnk) 
0{tmnk) 
0{tmnk) 
0{tmnk) 


Matlab K-means |(Seber(1984)!|Spath(1985)| 
BMD [Li(2005a) , Li(2005b)| 
Interior-point method 



Interest Matrix 



Table 1: Examined algorithms for solving the channelization 
problem and their running time. 



algorithms. Therefore, we have omitted the experimental results of 
the interior-point algorithm from the following graphs. 

In total, we have examined five different algorithms for solving 
the channelization problem. Table [T] summarizes the tested algo- 
rithms. Regarding their running time, not surprisingly, the random 
assignment is the lightest algorithm with running time of n (setting 
each flow to a random multicast group) plus mnk for going over 
all users and assigning them to groups such that they will receive 
all required flows. The two K-means variants as well as the BMD 
algorithms have a similar running time, where t is the number of 
iterations (typically five on problem sizes of thousands), since for 
each flow they go over all possible assignments of groups by tak- 
ing the minimal cost. The interior point method running time is 
dominated by the Hessian inversion in the Newton step. 

4.2 Second Step: Choosing Flows for Unicast 

Different ways of choosing flow-user pairs can be used. We con- 
centrated on two different types of heuristics: flow based and user 
based. Flow based heuristic means that each flow i is either sent 
to all users that are interested in it via unicast, or transmitted to all 
of them via multicast; one can either remove the "heaviest" flow 
or the "lightest" flow (in the sense of the amount of bandwidth re- 
quired to transmit that flow to all users interested in it). Clearly, we 
expect the lightest-flow approach to outperform the heaviest-flow 
approach; this has been validated by our simulations, and in the 
following graphs we will consider only the lightest-flow approach. 

User based heuristics means that all flows sent to user h are sent 
via unicast. That is, if user h receives any flow i using unicast, then 
any other flow i' that is sent to h is sent using unicast. Similar to 
the case of flow removal, we can choose to remove the "heaviest" 
or "lightest" user (here "heavy" and "lightweight" means the total 
bandwidth required to transmit all flows user h is interested in). 
Our simulations show the heaviest-user approach outperforms the 
lightest-user approach; the reason lies in the fact that heavy users 
listen to many multicast groups, and thus incur large overhead in fil- 
tering. In the following graphs we show the heaviest-user approach 
only. 

To sum up, we have tested the heuristics of removing the heav- 
iest/lightest flow/user from W, and moving it to T. In addition, 
each of the above options was tested twice: once by finding a single 
X, Y pair then removing as many flows/users from W as possible 
(termed "non-iterative"); and once by finding a X,Y pair, remov- 
ing a single flow/user from W, then finding a new X, Y pair (that 
optimizes the modified W), removing another flow/user from the 
altered W, and so on (as long as the cost function was minimized); 
termed "iterative". Our simulations have shown the non-iterative 
approach operates almost as good as the iterative, with significantly 
reduced computational effort. Thus, the following graphs depict 
only the non-iterative runs. 

In addition, we have tested several other heuristics. The basic 
idea is to remove a flow/user in a greedy way, i.e., repeatedly move 
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Figure 2: Initialization of user interest matrix W and message 
rate A according to the Market Distribution model. 



to unicast the user/flow/flow-user pair that minimizes the total cost 
(Eq. [T](, until cost does not decrease or bandwidth for unicast is 
fully utilizetfl. We call those heuristics greedy user, greedy flow 
and greedy flow-user pair accordingly. 

In practice, the flow-user pair heuristic did not perform well, 
while having a high computational cost. Thus, it is not shown in 
the graphs. To sum up, we have tested in total eleven different 
heuristics. In the following section, we present the simulations' 
results for the best-performing among these heuristics. 

5. EXPERIMENTAL RESULTS 

We have experimented with three possible initializations of the 
user interest matrix W. The first one is Random, where each user 
uniformly selects %3 of the flows. The second one. Market Dis- 
tribution, is based on a model of subscription patterns in finan- 
cial messaging systems |Tock et al.(2005)| . This model is based 
on stock market symbol rates collected from the New York Stock 
Exchange (NYSE). The matrix W was composed of 10,000 sym- 
bols divided into 10 markets, and 250 users. Each user was inter- 
ested in 4 markets, and chose some of the symbols in each selected 
market. The flows within a market are distributed exponentially, 
and the markets are distributed using Zipf distribution. The Market 
Distribution determines the flow rate A as well. 

Figure |2] shows an example of a user interest matrix (top), and 
the relative message rate of each symbol (bottom), according to the 
Market Distribution. 

The third initialization to the matrix W uses a subscription pat- 
tern captured from an IBM's WebSphere |web(2008)l test cluster. 
In it there are 79 processes subscribed to over 6100 topics. Sub- 
scription to the topics is entirely automatic, influenced by the con- 
figuration and load incurred upon the cell. 

As can be seen in Figure[3] the resulting interest matrix is clearly 
different from the one generated by the model of human behavior 
in financial markets (see Figure^. Importantly, many topics have 
identical audiences, which perfectly lends itself to multicast chan- 
nelization. 

5.1 Performance of the different algorithms 



Without loss of generality, we assume there is a limit on the total 
amount of bandwidth allocated for unicast. This limit is used as a 
stopping criteria for our algorithm 
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Figure 3: User interest matrix of an IBM WebSpliere cell, with 
automatic subscriptions to topics. 



Among the algorithms hsted in Table [T] only the K-Means and 
the interior-point method take the flow rates A into consideration. 
Thus, only the K-means was plotted twice, once with equal rate 
and once with rate derived by the Market Distribution, as shown 
in Figure |4] Using equal rate, both K-means and Matlab K-means 
have a superior performance. However, using Market Distribution 
rate, the K-Means algorithm has a noticeably superior performance 
over all others. 
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Figure 4: The non-hybrid channelization problem. Both K- 
means algorithms perform superiorly when rate is equal. 

In all graphs shown, the Y axis represents percentage cost from 
perfect multicast, where the term perfect multicast refers to the cost 
of transmission using multicast transport only (with no unicast), 
assuming there are unlimited number of multicast groups. Thus, 
perfect multicast means that each user receives exactly all flows it 
is interested in, each flow is transmitted only once and there is zero 
filtering in the network. 

In the hybrid setting, we allow some of the traffic to be trans- 
mitted using point-to-point connections. We have tested different 
heuristics for moving traffic from multicast to unicast (see Subsec- 
tion|4|2j. 

Figure |5] compares the top heuristics: lightest-flow, heaviest- 
user, greedy flow and greedy user. As can be seen, allowing some 
of the data to be unicasted reduces the cost. Evaluated using the 



Market Distribution, it seems that the greedy-user heuristic outper- 
forms the greedy-flow heuristic. However, this result is overturned 
when evaluating using the WebSphere distribution (in the sequel). 
Thus, the relative competitiveness of these two heuristics depends 
of the nature of the interest matrix. 
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Figure 5: Comparing different heuristics, with clear advantage 
for the greedy-based algorithms. 



5.2 Effect of the Interest Matrix w on perfor- 
mance 

The interest matrix W represents the flows each user is inter- 
ested in. The performance of the different heuristics is highly de- 
pendent on the content of W which represents the characteristics 
of the instance of the problem. In Figure|6]the lightest-flow heuris- 
tic is evaluated with different interest matrices: a random interest 
matrix where each flow has the same rate, a Market Distribution 
interest matrix where all flow have a fixed same rate, and a Mar- 
ket Distribution interest matrix where the rates are also according 
to Market Distribution. As can be seen in Figure |6] the algorithm 
performs best when running on a Market Distribution interest ma- 
trix; i.e. the heuristic is optimized for the expected distribution of 
a real-world financial market application. This happens because of 
the underlying Zipf probability, where the top flows are requested 
by a large number of users. This makes the clustering of top flows 
into multicast groups easier. 

Figure[7]shows how the different heuristics perform as the size of 
the system increases. Each point in the figure represents a different 
system: for point i £ {1, ...,6}, the system consists of 4000 -|- 
1000 ■ i flows, 50 ■ i users while the number of multicast groups is 
fixed to 50. We did not scale the number of multicast groups since 
it is usually dictated by the networking hardware. 

The relation between the different heuristics is mostly preserved 
at different system sizes. An interesting exception is point i = 2, 
in which the greedy flows outperforms the greedy users heuristic. 
This effect is not surprising as different systems (specifically, the 
ratio between flows, users and multicast groups) can change the 
relative efficiency of the different heuristics. 

To show the behaviorial difference of the heuristics when run- 
ning on a mechanical subscription trace, we have ran the different 
heuristics on the IBM WebSphere distribution (see Figure [8]l. As 
can be seen, when the subscription patters closely overlap, the flow 
based heuristic outperform the user-based heuristics. It is inter- 
esting to note that the heavy-user heuristic actually increases the 
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Figure 6: Different interest matrices and their effect on perfor- 
mance. 
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Figure 8: Comparing different heuristics on a trace of a Web- 
Sphere cell. 
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Figure 7: Effect of scaling on performance. 



Figure 9: System cost for the hybrid approach using the greedy 
flow heuristic. 



cost, since this heuristic moves the heaviest user and does not con- 
sider the cost of the move. In addition, the greedy user and greedy 
flow heuristics reach their maximal improvement at very low uni- 
cast bandwidth. This phenomena is due to the structured nature of 
the interest matrix, incurred by the mechanical subscription pattern. 

Figure |9] represents well the benefits of using our hybrid ap- 
proach. The greedy heuristics is forced to use a given percentage of 
unicast bandwidth (the X-axis), using the WebSphere subscription 
pattern. Using the hybrid approach, the greedy flow heuristic im- 
proves upon both the multicast only and unicast only schemes. The 
total cost of transmission is reduced in a way which is not possible 
using a single scheme. 

5.3 Discussion 

We have experimented with different heuristics for selecting which 
of the data should be transmitted using unicast. Under the stock 
market model, the best heuristics are greedy heuristics which re- 
peatedly move a single user/flow from multicast to unicast to min- 
imize the total cost. In this setting, the distribution leads to sev- 
eral multicast groups which carry a large number of non-identical 
heavy flows. Thus, a user that is interested in any heavy flow might 
be forced to receive it via a multicast group that carries other heavy 
flows that he does not need, leading to a high filtering cost. In this 



case, the gain of moving a single user to unicast outweighs the gain 
that might be achieved by moving the best flow to unicast, because 
the best flow to be moved is usually fairly light weight. Therefore, 
the heuristic of greedily moving users from multicast to unicast 
works well in this setting. 

The second scenario we tested consisted of a user interest matrix 
from a WebSphere test cluster. As the users of this problem are 
software / script based, their interests are homogenous. Thus, many 
users can use the same multicast group with no need for filtering. 
Therefore, a flow which is of interest to a few users can incur heavy 
cost on filtering, if it is assigned to a multicast group that many 
users listen to. This property causes the user based heuristics to 
perform poorly, while the flow based heuristics perform well. 

In other words, the greedy-user and greedy-flow schemes "thin- 
out" the interest matrix by removing rows and columns, respec- 
tively, making the resulting interest matrix more amenable to chan- 
nelization. The relative competitiveness of these heuristics depend 
on the structure of the interest matrix. 

6. CONCLUSION 

This paper analyzes the hybrid channelization problem. We for- 
mally define the problem as an optimization problem and propose 



efficient heuristics for solving it. Our general framework starts 
from a solution to a non-unicast problem and tries to improve it by 
allowing some of the data to be transmitted via unicast. Similarly, 
we start from a solution which utilizes only unicast, then improve 
it by allowing some of the data to be transmitted via multicast. 

We have tested our heuristics against two different real-world 
scenarios. First is a simulated brokers' interest in financial mar- 
ket data and the second is mechanical subscription pattern captured 
from an IBM WebSphere test cluster. Five different algorithms for 
solving the non-unicast channelization problem where examined, 
and a single algorithm, the K-means algorithm was identified to 
perform the best in all settings. 

In total we have experimented with eleven different heuristics. 
The greedy heuristics (that improve the cost function directly) per- 
formed better than the others. However, greedy heuristics should 
be taken with a salt of grain, as different problems incur differ- 
ent distributions on the user interest matrix W and on the rate of 
the flows. Thus, different heuristics may perform differently as the 
problem context changes. 

To conclude, by allowing a combination of multicast and uni- 
cast transmissions, we gain in reduced host and network resource 
consumption. It seems that the performance of a publish subscribe 
system is highly depended on the subscription patterns. Our hy- 
pothesis is that user based heuristics combined with the flow based 
heuristics cover a large range of problems. Thus, we provide a 
range of heuristics that can be used to practically deploy a publish 
subscribe system efficiently. 
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